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Abstract 

Heritable quantitative characters underline complex genetic traits. However, a single quantitative phenotype may 
not be a suitably good surrogate for a clinical end point trait. 

It may be more optimal to use a multivariate phenotype vector correlated with the end point trait to carry out an 
association analysis. Existing methods, such as variance components and principal components, suffer from 
inherent limitations, such as lack of robustness or difficulty in biological interpretation of association findings. In an 
effort to circumvent these limitations, we propose a novel regression approach based on a conditional binomial 
model to detect association between a single-nucleotide polymorphism and a multivariate phenotype vector. We 
use our proposed method to analyze data on systolic and diastolic blood pressure levels provided in Genetic 
Analysis Workshop 18. We find that the bivariate analysis of the two phenotypes yields more promising results in 
terms of lower p-values compared to univariate analyses. 



Background 

Most clinical end point traits are governed by a set of 
quantitative precursors and a single precursor is unlikely 
to explain a significant amount of the variation in the end 
point trait. Thus, it may be a prudent strategy to analyze a 
multivariate vector of phenotypes for association mapping 
of a clinical end point trait. The major statistical challenge 
in the analyses of multivariate phenotypes lies in the mod- 
eling of the vector of phenotypes. Univariate analyses of 
the constituent phenotypes may lead to the statistical pro- 
blem of multiple comparisons [1]. Likelihood-based meth- 
ods, such as variance components [2,3], are susceptible to 
the choice of the joint probability distribution of the com- 
ponents of the vector. An alternative approach that cir- 
cumvents the problem of modeling the multivariate 
phenotype is to obtain a reduced univariate phenotype 
using principal components [4]. However, association 
results based on principal components may be difficult to 
interpret biologically. We propose a novel binomial regres- 
sion approach that models the likelihood of the number 
of copies of the minor allele at a single-nucleotide 
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polymorphism (SNP) conditional on the vector of multi- 
variate phenotype. We apply our proposed method to ana- 
lyze systolic and diastolic blood pressure levels on 
unrelated individuals using longitudinal data over four 
time points provided in Genetic Analysis Workshop 18 
(GAW18). We compare the association results based on a 
bivariate analysis of the two phenotypes with those based 
on univariate analyses. 

Data description 

For our analyses, we use data on systolic blood pressure 
(SBP) levels and diastolic blood pressure (DBP) levels on 
157 unrelated individuals along with their genotypes at 
all the available 456,752 variant sites distributed over 11 
autosomal chromosomes in the genome-wide association 
studies data provided in GAW18. In addition to age, we 
use smoking status and medication indicator (both 
defined as binary variables) at each time point of exami- 
nation as covariates, as these factors could be potential 
confounders in the association analyses. Both the SBP 
and the DBP levels are adjusted for these covariates for 
each time point and the tests for association are per- 
formed on the adjusted phenotypes. The adjustment and 
the association analyses are based on 139 individuals, as 
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data on all the covariates are not available for the remain- 
ing individuals. The percentages of individuals with miss- 
ing phenotype data at the different time points are 8%, 
35%, 35%, and 75%, respectively. On tabulating the per- 
centages of individuals with phenotype data missing at 
multiple time points, we do not observe any pattern in 
the missing data. For our association analyses, we impute 
the missing data using a procedure outlined in the 
following section. 

Methods 

Statistical methodology 

Imputation of missing phenotype values and covariate 
adjustment 

The assumption of multivariate normality provides a 
computationally elegant framework for the expectation 
maximization algorithm [5] to estimate parameters 
when data are missing. Blood pressure levels have tradi- 
tionally been believed to follow a lognormal distribution. 
While the Kolmogorov-Smirnov test did not show any 
significant departure from normality for the systolic and 
diastolic blood pressure levels at any of the time points, 
some of the /j-values are very close to the threshold of 
0.05. We thus use a logarithmic transformation on each 
of the phenotypes to induce normality. Suppose the vec- 
tor of log-transformed values of any of the two pheno- 
types at the 4 time points is represented as X = {Xi, X2, 
X3, X4). If Y denotes the vector comprising those com- 
ponents of X that are missing and Z comprising the 
components that are available for an individual, Y is 
estimated via an expectation maximization algorithm as 
the expectation of Y conditioned on Z and is given by 
|1y-£yzSzz^(Z-Hz)) where, (ly and [iz are the mean vec- 
tors of Y and Z, respectively; Zyzis the matrix of covar- 
iance between Y and Z, while £zz is the dispersion 
matrix of Z. At every time point, we perform a linear 
regression of the log-transformed values of each of the 
two phenotypes (available as well as imputed) on age, 
smoking status, and medication indicator based only on 
those observations for which data were available on all 
the covariates and thereby obtain the residuals of each 
of the regressions. 

Test for association using binomial regression 

The phenotypes for our association analyses are the 
adjusted SBP and DBP levels at each time point 
obtained using the algorithm described in the preceding 
section. We propose a binomial regression framework to 
test for association of a SNP with a multivariate pheno- 
type. Suppose X = {Xi, X2, X3,..., X]^) denotes a vector of 
k phenotypes and N denotes the number of copies of 
the minor allele (0, 1, or 2) at a SNP. We model the 
conditional distribution of N given X as binomial with 
parameters 2 and piX) where, piX) is a logistic link 
function given by: 



exp()8o + Ell P'Xj) 
1 + expifio + Eti PiXi) 

The test for association is equivalent to testing Hq: 
Pi =P2= — =Pk = 0 versus Hi: not Hq and the log-like- 
lihood ratio test statistic is distributed as chi-squares 
with k degrees of freedom under the null hypothesis. 
We compare the relative performances of 5 phenotype 
vectors in detecting association: (a) Ti: the adjusted 
SBP levels at the 4 time points; (b) T2: the adjusted 
DBP levels at the 4 time points; (c) T^: the adjusted 
SBP levels summarized by the first 2 principal compo- 
nents across the four time points; (d) T^: the adjusted 
DBP levels summarized by the first two principal com- 
ponents across the four time points; and (e) T^: the 
adjusted SBP as well as the adjusted DBP levels sum- 
marized by a bivariate phenotype comprising the first 
2 principal components corresponding to each of the 
phenotypes across the four time points. The above 
choice of the principal components is motivated by the 
fact that approximately 75% of the variation in each of 
the two phenotypes is explained by the corresponding 
first two principal components. Because the sample 
size is not very large, we consider only common var- 
iants for our association analyses and select only those 
SNPs with a minor allele frequency of at least 0.01. 
We also exclude those SNPs that provide significant 
evidence of departure from Hardy- Weinberg equili- 
brium (based on the Bonferroni corrected threshold at 
the genome-wide level). Thus, the tests for association 
based on the proposed binomial regression model are 
carried out on 426,193 SNPs. To correct for multiple 
testing, we use the false discovery rate (FDR) proce- 
dure [6] with an overall rate of 0.05. 

Results 

Among the five phenotype vectors considered, the one 
comprising the first two principal components of both 
SBP and DBP levels (Tg) provides the most promising 
association findings. Table 1 presents the five most sig- 
nificant association findings corresponding to each 
choice of phenotype vector. The only SNP that exhibits 
significant evidence of association at the genome-wide 
level is rsl2634258 on chromosome 3 (adjusted FDR of 
0.047) with T5. We observe that this SNP also ranks 
among the top five SNPs in terms of lowest unadjusted 
/^-values when (the first two principal components of 
the SBP levels at the four time points) is analyzed. 

Conclusions 

We have developed a binomial regression model that can 
incorporate multiple phenotypes for association analysis 
of the multivariate phenotype vector. The method has 
several advantages: it neither involves any modeling of 
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Table 1 Five most significant association findings using tlie different phenotype vectors 
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The figures in parentheses represent the unadjusted p-values and chromosomal locations, respectively. 



the correlation structure within the components of the 
multivariate phenotype as required in likelihood-based 
approaches nor does it compromise on the biological 
interpretation by transforming the multivariate pheno- 
type to principal components. 

The SNP rsl26342S8 that exhibits the most promising 
evidence of association with multiple definitions of the 
phenotype vector is located in the intergenic region 
between the genes PTPRG (receptor-type tyrosine-pro- 
tein phosphatase gamma) and FHIT (fragile histidine 
triad) on 3pl4.2. The RNA expression in PTPRG has 
been reported to be upregulated for arrhythmogenic 
right ventricular cardiomyopathy in humans [7]. On the 
other hand, the RNA expression of FHIT is downregu- 
lated in mouse for low blood pressure [8]. Although we 
obtain only one association finding that satisfies the 
genome-wide FDR threshold of 0.05, it is likely that the 
sample size is grossly inadequate to detect associations 
for a complex phenotype. 

While constructing the phenotype vectors, we had 
considered the first two principal components of the 
blood pressure levels across the four time points. As 
mentioned earlier, these two principal components 
explain approximately 75% of the total variation of 
blood pressure levels across the time points. Even 
though the first principal component explains approxi- 
mately 55% of the variation, the first 3 principal compo- 
nents explain approximately 90% of the variation. 
Although for brevity we do not present the association 
findings for the different choices of the principal com- 
ponents, we find that the SNP rs7644778 ranks among 
the top two association findings with the first principal 
component of each of SBP and DBP levels across the 
four time points, albeit not significant at the genome- 
wide level. We note that this SNP also exhibits evidence 
of association with first two principal components of 
each of the two phenotypes. The SNP is located in the 
gene TPRGl (tumor protein p63 regulated 1). The RNA 
expression of this gene has been reported to be upregu- 
lated for cardiomyopathy in humans. On the other 



hand, the five most significant findings with the first 
two principal components are identical to those with 
the first three principal components, but none of them 
is significant at the genome-wide level. This may be 
explained by the fact that the test with 3 principal com- 
ponents for each of the two phenotypes has two addi- 
tional degrees of freedom under the null hypothesis and 
may suffer from loss in power. 

Among the five phenotype vectors (Ti, T2, T^, T^ and 
T^), we find that T^ (ie, the phenotype that considers 
both SBP as well as DBP levels simultaneously) provides 
the most significant association findings (in terms of 
p-value). Even though the empirical findings seem to 
suggest that association analyses based on a multivariate 
phenotype comprising traits modulated by common 
genetic variants are likely to be more powerful than uni- 
variate analyses of the constituent traits of the multivari- 
ate phenotype vector, such a hypothesis can only be 
validated using extensive simulations. We have per- 
formed some independent simulations on 500 unrelated 
individuals under a few pleiotropic bivariate phenotype 
models: (a) the constituent traits have a bivariate normal 
distribution, (b) both are distributed as chi-squares, and 
(c) 1 trait has a normal distribution, while the other is bin- 
ary. Our preliminary results based on 1000 replications 
indicate that our proposed method maintains the correct 
false-positive rates under all the phenotype models. We 
also find that simultaneous analyses of both the pheno- 
types as covariates consistently yield more power com- 
pared to analyzing each univariate trait using the binomial 
regression. On the other hand, the power of the proposed 
method based on both phenotypes is higher, similar, or 
marginally lower compared to that based on the first prin- 
cipal component, depending on whether the correlation 
between the traits is low (less than 0.25), medium 
(between 0.25 and 0.5) or high (greater than 0.5). The 
above phenomenon was observed irrespective of the 
bivariate phenotype model considered. When the gene 
effects are in different directions for the two phenotypes 
(the minor homozygous genotype corresponds to high 
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values of 1 phenotype, but low values of the other pheno- 
type), the first principal component performs very poorly 
and may yield lower power compared to analyzing univari- 
ate phenotypes. However, further simulations based on 
higher dimensional phenotype vectors need to carried out 
to obtain greater insights into the relative advantages of 
the proposed method using the multivariate phenotype 
vector over data reduction approaches such as using a sub- 
set of the principal components of the phenotype vector. 



8. Puig 0, Wang IM, Cheng P, Zhou P, Roy S, Cully D, Peters M, Benita Y, 
Thompson J, Cai TQ: Transcriptome profiling and networl< analysis of 
genetically hypertensive mice identifies potential pharmacological 
targets of hypertension. Physiol Genomics 2010, 42:24-32. 

doi:l 0.1 1 86/1 753-6561 -8-Sl -S74 

Cite this article as: Majumdar et al:. Association mapping of blood 
pressure levels in a longitudinal framework using binomial regression. 

SMC Proceedings 2014 8{Suppl 1):S74. 



Competing interests 

The authors declare that they have no competing interests. 
Authiors' contributions 

SG and AM developed the proposed method. AM and IM wrote the 
computer codes and performed the data analyses. AM participated in the 
compilation and interpretation of the results. All authors read and approved 
the final manuscript. 

Acl<nowledgements 

This work was partially supported by the Council of Scientific and Industrial 
Research (CSIR) fellowship 09/093(011 2)/2008-EMR-l to Arunabha Majumdar. 
The GAW18 whole genome sequence data were provided by the T2D- 
GENES Consortium, which is supported by NIH grants UOl DK085524, UOl 
DK085584, UOl DK085501, UOl DK085526, and UOl DK085545. The other 
genetic and phenotypic data for GAW18 were provided by the San Antonio 
Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, 
which are supported by NIH grants POl HL045222, ROl DK047482, and ROl 
DK053889. The Genetic Analysis Workshop is supported by NIH grant ROl 
GM031575. The GAW18 whole genome sequence data were provided by 
the T2D-GENES Consortium which is supported by NIH grants UOl 
DK085524, UOl DK085584, UOl DK085501, UOl DK085526, and UOl 
DK085545. The other genetic and phenotypic data for GAW18 were 
provided by the San Antonio Family Heart Study and San Antonio Family 
Diabetes/Gallbladder Study, which are supported by NIH grants POl 
HL045222, ROl DK047482, and ROl DK053889. The Genetic Analysis 
Workshop is supported by NIH grant ROl GM031575. 
This article has been published as part of BMC Proceedings Volume 8 
Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the 
supplement are available online at http://www.biomedcentral.com/bmcproc/ 
supplements/8/Sl . Publication charges for this supplement were funded by 
the Texas Biomedical Research Institute. 



Published: 17 June 2014 



References 

1. Yang Q, Wu H, Guo CY, Fox CS: Analyze multivariate phenotypes in 
genetic association studies by combining univariate association tests. 

Genet Epidemiol 2010, 34:434-454. 

2. Almasy L, Blangero J: Variance components methods for analysis of 
complex phenotypes. Cold Spring Harb Protoc 2010, 5:77. 

3. Minica CC, Boosma Dl, van der Sluis S, Dolan CV: Genetic association in 
multivariate phenotypic data: power in five models. Twin Res Hum Genet 
2010, 13:525-543. 

4. Elston RC, Buxbaum S, Jacobs KB, Olson JM: Haseman and Elston revisted. 
Genef Epidemiol 2000, 19:1-17. 

5. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete 
data via the EIM Algorithm. J R Stat Soc Series B Stat Methodol 1977, 
39:1-38. 

6. Benjamin! Y, Hochberg Y: Controlling the false discovery rate: a practical 
and powerful approach to multiple test. J R Stat Soc Series B Stat 
Mettiodol 1995, 57(1):289-300. 

7. Gaertner A, Schwientek P, Ellinghaus P, Summer H, Golz S, Kassner A, 
Schuiz U, Gummert J, Milting H: Myocardial transcriptome analysis of 
human arrhythmogenic right ventricular cardiomyopathy. Physiol 
Genomics 2012, 44:99-109. 



Submit your next manuscript to BiolVled Central 
and tal<e full advantage of: 

• Convenient online submission 

• Thiorough peer review 

• No space constraints or color figure cliarges 

• Immediate publication on acceptance 

• Inclusion in PublVled, CAS, Scopus and Google Scholar 

• Researchi whiichi is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



BioMed Central 



